[Dev][feat] Support CUDA Graph capture offloading modules#3219
[Dev][feat] Support CUDA Graph capture offloading modules#3219lhb8125 wants to merge 103 commits intoNVIDIA:devfrom
Conversation
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
…ithub.com/lhb8125/Megatron-LM into hongbinl/activation_offloading_cuda_graph
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
|
/ok to test 0200121 |
| Fine-grained offloading is compatible with CUDA graphs. When CUDA graph is enabled, the following constraints apply: | ||
|
|
||
| - `attn_norm` and `mlp_norm` **cannot** be offloaded (they cross CUDA graph boundaries). | ||
| - `cuda_graph_scope` must include `attn` and `moe_router`. |
There was a problem hiding this comment.
Can I use "moe" scope if I'm in a drop-pad MoE?
Can I offload attention part modules if my cuda graph scope is only "moe_router"? This may be needed since some cases have dynamic-shaped attention so only the router part can be captured.
There was a problem hiding this comment.
I removed this hard limitation, now the scope could be moe_router alone or moe.
|
|
||
| Fine-grained offloading is compatible with CUDA graphs. When CUDA graph is enabled, the following constraints apply: | ||
|
|
||
| - `attn_norm` and `mlp_norm` **cannot** be offloaded (they cross CUDA graph boundaries). |
There was a problem hiding this comment.
unless using "moe" cudagrpah scope in a drop-pad or sync-free MoE.
There was a problem hiding this comment.
what if we only capture moe_router or moe_preprocess? Is it still true?
There was a problem hiding this comment.
I think so. If we only capture moe_router, mlp_norm works as the input buffer of the graph, so not offloadable. The only exception is that we use attn+moe scope for drop-pad MoE, then the mlp_norm is totally inside the graph, so offloadable.
There was a problem hiding this comment.
btw you cannot only capture moe_preprocess . moe_preprocess must go together with moe_router .
2. remove flush_delayed_groups() when the training is not in replay mode Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
|
/ok to test b481fa9 |
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
…ttps://github.com/lhb8125/Megatron-LM into hongbinl/activation_offloading_refactor_cuda_graph
|
/ok to test ce84682 |
| 3. **Apply fraction**: Only a fraction of eligible groups are actually offloaded (controlled by `activation_offload_fraction`). | ||
| 4. **Print summary table**: An ASCII table of per-rank offload bytes is printed for debugging. | ||
|
|
||
| ### CPU Tensor Pool |
There was a problem hiding this comment.
It's indeed a CPU tensor pool, which reuses the cpu tensors in pool to avoid cudaMallocHost, since it's not supported by cuda graph.
GPU tensors are allocated and freed on-demand from pytorch memory pool.
|
|
||
| ### Warmup and Adaptive Offloading | ||
|
|
||
| The first training iteration serves as a **warmup phase** where the manager records tensor groups, their sizes, and the execution order. After warmup, a `post_warmup_callback` runs to: |
There was a problem hiding this comment.
So we cannot capture cudagraphs on the first training iteration? If so, we should assert cuda_graph_warmup_steps>0 when offloading is enabled.
There was a problem hiding this comment.
Yes, the assertion was added but removed by accident. Let me add it back.
|
/claude review |
megatron/core/pipeline_parallel/fine_grained_activation_offload.py
Outdated
Show resolved
Hide resolved
megatron/core/pipeline_parallel/fine_grained_activation_offload.py
Outdated
Show resolved
Hide resolved
|
|
||
| # This is to avoid the CPU overhead of multiple d2h copies | ||
| if self.offload_expert_fc1: | ||
| if self.offload_expert_fc1 and not self.config.fp8: |
There was a problem hiding this comment.
Anything special about fp8?
There was a problem hiding this comment.
This was to avoid multiple d2h copies, but it also comes with the doubling bytes of offloading. So it's a tradeoff. Since we can delay the offloading after graph replay, we could disable the save_original_input by default.
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
|
/ok to test a6e16a9 |
What does this PR do ?
PR to main branch
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.